병렬 프로그래밍 개요#

강좌: 수치해석 프로젝트

개요#

반도체의 성능은 Moore의 법칙 수준으로 향상되어 왔음
- 단일 코어의 성능 향상 폭은 줄어들음
여러 컴퓨터를 묶은 클러스터 방식의 슈퍼컴퓨터
Multi-core, Many-core 프로세서를 활용한 병렬 계산
- GPU를 이용한 인공지능 학습

컴퓨터 구조#

폰 노이만 구조
- CPU, 메모리, 저장장치, 네트워크 등으로 구성됨

vn-fig — Fig. 14 Von Neumann architecture (From Wikipedia)#

CPU : ALU 와 CU, 그리고 캐시로 구성됨
- 멀티코어 프로세서

dualcore-fig — Fig. 15 Dual Core Processor (From Wikipedia)#

SIMD (Single instruction, multiple data) : 벡터 계산 (MMX, SSE, AVX, neon)

simd-fig — Fig. 16 SIMD (From Wikipedia)#

Memory : 메모리 속도는 상대적으로 덜 빨라짐
- 종류 : DDR, GDDR, HBM
Network
- 라우터, 선, 카드
- 종류 : Ethernet (1G, 10G), Omnipath, Infiniband

병렬 프로그래밍 모델#

Shared Memory Programming#

공유 메모리를 이용해서 여러 프로세스 (또는 쓰레드)가 데이터를 공유하면서 병렬 계산
라이브러리 : OpenMP, POSIX thread (pthread), Intel TBB
Fork and Join Model

fork_join-fig — Fig. 17 Fork and Join model (From Wikipedia)#

Race condition 을 조심해야 함

Message Passing Model#

각 프로세스가 독립된 메모리를 가지고 있으며 통신으로 자료 교환하면서 병렬 계산
라이브러리 : MPI (MPICH, OpenMPI, MS-MPI, Intel MPI)

mpi-fig — Fig. 18 Message Passing model (From KSC)#

Dead lock을 조심해야 함

병렬 계산 성능#

Amdahl’s law#

전체 코드 중 \(p\) 만큼만 병렬화해서 \(N\) 배 빨라졌을 경우 총 성능 향상은 \(S\) 임.

\[ S = \frac{1}{(1-p) + \frac{p}{N}} \]

Python 병렬 프로그래밍#

Numba#

prange를 이용한 Loop 자동 병렬 기능 제공
Threading layer에 따라 OpenMP, Intel TBB 등을 제공

mpi4py#

MPI 라이브러리 바인딩

예제#

Laplace 코드를 Fork and Join model로 병렬화 하시오

import numba as nb
import numpy as np

# Use OpenMP
from numba import config
config.THREADING_LAYER = 'omp'

# For Intel MKL as BLAS and LAPCK
import mkl

def solve_laplace(n, solver, tol=1e-5, order='C'):
    """
    Laplace Equation solver
    
    Parameters
    ----------
    n : integer
        size
    solver : function
        iterative solver
    tol : float
        tolerance
    order : string
        'C' | 'F'
        
    Returns
    -------
    err : float
        residual
    """
    ti = np.zeros((n+2, n+2), order=order)
    dt = np.zeros((n+2, n+2), order=order)

    def bc(t):
        t[-1, 1:-1] = 300
        t[0, 1:-1] = 100
        t[1:-1, -1] = 100
        t[1:-1, 0] = 100

    err = 1
    hist = []
    while err > tol:
        # Apply BC
        bc(ti)

        # Run Gauss-Seidel
        solver(n, ti, dt)

        # Compute Error
        err = np.linalg.norm(dt) / n
        
    return err

@nb.njit(fastmath=True)
def jacobi_nb(n, ti, dt):
    """
    Jacobi method
    
    Parameters
    ----------
    n : integer
        size
    ti : float
        current time
    dt : array
        difference
    """
    for i in range(1, n+1):
        for j in range(1, n+1):
            dt[i, j] = 0.25*(ti[i-1, j] + ti[i, j-1] + ti[i+1, j] + ti[i, j+1]) - ti[i, j]
            
    # Update
    ti += dt


@nb.njit(fastmath=True, parallel=True)
def jacobi_nbp(n, ti, dt):
    """
    Jacobi method
    
    Parameters
    ----------
    n : integer
        size
    ti : float
        current time
    dt : array
        difference
    """
    for i in nb.prange(1, n+1):
        for j in range(1, n+1):
            dt[i, j] = 0.25*(ti[i-1, j] + ti[i, j-1] + ti[i+1, j] + ti[i, j+1]) - ti[i, j]
            
    # Update
    for i in nb.prange(n+2):
        for j in range(n+2):
            ti[i,j] += dt[i,j]

n = 2048
%time solve_laplace(n, jacobi_nb, tol=5e-3)

/home/jinseok/miniconda3/envs/idp/lib/python3.9/site-packages/llvmlite/llvmpy/__init__.py:3: UserWarning: The module `llvmlite.llvmpy` is deprecated and will be removed in the future.
  warnings.warn(
/home/jinseok/miniconda3/envs/idp/lib/python3.9/site-packages/llvmlite/llvmpy/core.py:8: UserWarning: The module `llvmlite.llvmpy.core` is deprecated and will be removed in the future. Equivalent functionality is provided by `llvmlite.ir`.
  warnings.warn(

CPU times: user 6min 3s, sys: 5.83 s, total: 6min 8s
Wall time: 23.4 s

0.004998731199856652

# At AMD Threadripper 5955wx (16C32T)
for i in [1, 2, 4, 8, 16, 32]:
    # Adjust number of threads for numba and MKL
    nb.set_num_threads(i)
    mkl.set_num_threads(i)
    print("Number of Threads :", i)
    
    # Measure time
    %time solve_laplace(n, jacobi_nbp, tol=5e-3)

Number of Threads : 1
CPU times: user 27.3 s, sys: 79.4 ms, total: 27.4 s
Wall time: 24.6 s
Number of Threads : 2
CPU times: user 22.9 s, sys: 32 ms, total: 23 s
Wall time: 11.5 s
Number of Threads : 4
CPU times: user 23.8 s, sys: 60 ms, total: 23.9 s
Wall time: 5.97 s
Number of Threads : 8
CPU times: user 26.2 s, sys: 92.1 ms, total: 26.3 s
Wall time: 3.29 s
Number of Threads : 16
CPU times: user 33.3 s, sys: 192 ms, total: 33.5 s
Wall time: 2.09 s
Number of Threads : 32
CPU times: user 2min 20s, sys: 1.87 s, total: 2min 22s
Wall time: 4.48 s

병렬 프로그래밍 개요

Contents

병렬 프로그래밍 개요#

개요#

컴퓨터 구조#

병렬 프로그래밍 모델#

Shared Memory Programming#

Message Passing Model#

병렬 계산 성능#

Amdahl’s law#

Python 병렬 프로그래밍#

Numba#

mpi4py#

예제#